Machine translation

ABSTRACT

A computer natural language translation system, comprising: means for inputting source language text; means for outputting target language text; transfer means for generating said target language text from said source language text using stored translation data generated from examples of source and corresponding target language texts, in which said stored translation data comprises a plurality of translation units each consisting of an aligned language unit (e.g. word). This invention generates the translation units for the translation system from a new source-target translation pair of examples, by generating source and target analyses and then finding the alignments by scoring and matching.

This invention relates to machine translation. More particularly, thisinvention relates to example-based machine translation. Machinetranslation is a branch of language processing.

In most machine translation systems, a linguist assists in the writingof a series of rules which relate to the grammar of the source language(the language to be translated from) and the target language (thelanguage to be translated to) and transfer rules for transferring datacorresponding to the source text into data corresponding to the targettext. In the classical “transfer” architecture, the source grammar rulesare first applied to remove the syntactic dependence of the sourcelanguage and arrive at something closer to the semantics (the meaning)of the text, which is then transferred to the target language, at whichpoint the grammar rules of the target language are applied to generatesyntactically correct target language text.

However, hand-crafting rules for such systems is expensive, timeconsuming and error prone. One approach to reducing these problems is totake examples of source language texts and their translations intotarget languages, and to attempt to extract suitable rules from them. Inone approach, the source and target language example texts are manuallymarked up to indicate correspondences.

Prior work in this field is described in, for example, Brown P F, CockeJ, della Pietra S A, della Pietra V J, Jelinek F, Lafferty J D, Mercer RL and Roossin P S 1990, ‘A Statistical Approach to Machine Translation’,Computational Linguistics, 16 2 pp. 79-85; Berger A, Brown P, dellaPietra S A, della Pietra V J, Gillett J, Lafferty J, Mercer R, Printz Hand Ures L 1994, ‘Candide System for Machine Translation’, in HumanLanguage Technology: Proceedings of the ARPA Workshop on Speech andNatural Language; Sato S and Nagao M 1990, ‘Towards Memory-basedTranslation.’, in COLING '90; Sato S 1995, ‘MBT2: A Method for CombiningFragments of Examples in Example-based Translation’, ArtificialIntelligence, 75 1 pp. 31-49; Güvenir H A and Cicekli I 1998, ‘LearningTranslation Templates from Examples’, Information Systems, 23 6 pp.353-636; Watanabe H 1995, ‘A Model of a Bi-Directional TransferMechanism Using Rule Combinations’, Machine Translation, 10 4 pp.269-291; Al-Adhaileh M H and Kong T E, ‘A Flexible Example-based Parserbased on the SSTC’, in Proceedings of COLING-ACL '98, pp. 687-693.

Our earlier European application No. 01309152.5, filed on 29 Oct. 2001,Agents Ref: J00043743EP, Clients Ref: A26213, describes a machinetranslation system in which example source and target translation textsare manually marked up to indicate dependency (for which, see Mel'cuk IA 1988, Dependency Syntax: theory and practice, State University of NewYork Albany) and alignment between words which are translations of eachother. The system described there then decomposes the source and targettexts into smaller units by breaking the texts up at the alignments. Thetranslations units represent small corresponding phrases in the sourceand target languages. Because they are smaller than the original text,they are more general. The translation system can then make use of thetranslation units to translate new source language texts whichincorporate the translation units in different combinations to those inthe example texts from which they were derived.

Our earlier European applications 01309153.3, filed 29 Oct. 2001, AgentsRef: J00043744EP, Clients Ref: A26214, and 01309156.6, filed 29 Oct.2001, Agents Ref: J00043742EP, Clients Ref: A26211, describeimprovements on this technique. All three of these applications areincorporated herein in their entirety by reference.

Our earlier applications described manual alignments of words in thesource and target languages. In most other proposed systems, manualalignment is performed, although lexical alignment is sometimes doneautomatically (see Brown P F, Cocke J, della Pietra S A, della Pietra VJ, Jelinek F, Lafferty J D, Mercer R L and Roossin P S 1990, ‘AStatistical Approach to Machine Translation’, Computational Linguistics,16 2 pp. 79-85 and Güvenir H A and Cicekli 11998, ‘Learning TranslationTemplates from Examples’, Information Systems, 23 6 pp. 353-636).

An aim of the present invention is to provide an automatic system forobtaining translation units for use in subsequent translation, forexample for systems as described in our above referenced earlierEuropean applications.

The present invention is defined in the claims appended hereto, withadvantages, preferred features and embodiments which will be apparentfrom the description, claims and drawings.

It may advantageously be used together with the invention described inour European application EP 02 252 326 filed on the same day (28 Mar.2002) and through the same office as this application, agent's referenceJ00044152EP, applicant's reference A30154.

The invention is generally applicable to methods of machine translation.Embodiments of the invention are able to generalise from a relativelysmall number of examples of text, and this allows such embodiments to beused with the text held in, for example, a translation memory asdescribed by Melby A K and Wright S E 1999, ‘Leveraging TerminologicalData For Use In Conjunction With Lexicographical Resources’, inProceedings of the 5^(th) International Congress on Terminology andKowledge Representation, pp. 544-569.

Embodiments of the present invention will now be described, by way ofexample only, with reference to the accompanying drawings in which:

FIG. 1 is block diagram showing the components of a computer translationsystem according to a first embodiment;

FIG. 2 is a block diagram showing the components of a computer formingpart of FIG. 1;

FIG. 3 is a diagram showing the programs and data present within thecomputer of FIG. 2;

FIG. 4 is an illustrative diagram showing the stages in translation oftext according to the present invention;

FIG. 5 is a flow diagram showing an annotation process performed by theapparatus of FIG. 1 to assist a human user in marking up example texts;

FIG. 6 shows a screen produced during the process of FIG. 5 to allowediting;

FIG. 7 is a flow diagram giving a schematic overview of the subsequentprocessing steps performed in a first embodiment to produce data forsubsequent translation;

FIG. 8 shows a screen display produced by the process of FIG. 5illustrating redundant levels;

FIG. 9 is a flow diagram illustrating the process for eliminating theredundant levels of FIG. 8; and

FIG. 10 illustrates a structure corresponding to that of FIG. 8 afterthe performance of the process of FIG. 9;

FIG. 11 shows the dependency graph produced by the process of FIG. 5 fora source text (in English) which contains a relative clause;

FIG. 12 is a flow diagram showing the process performed by the firstembodiment on encountering such a relative clause; and

FIG. 13 corresponds to FIG. 11 and shows the structure produced by theprocess of FIG. 12;

FIG. 14 shows the structure produced by the process of FIG. 5 for asource text which includes a topic shifted phrase;

FIG. 15 is a flow diagram showing the process performed by the firstembodiment in response to a topic shifted phrase; and

FIG. 16 corresponds to FIG. 14 and shows the structure produced by theprocess of FIG. 15;

FIG. 17 is a flow diagram showing an overview of the translation processperformed by the embodiment of FIG. 1;

FIG. 18 (comprising FIGS. 18 a and 18 b) is a flow diagram showing inmore detail the translation process of the first embodiment;

FIGS. 19 a-19 f show translation components used in a second embodimentof the invention to generate additional translation components forgeneralisation;

FIG. 20 is a flow diagram showing the process by which such additionalunits are created in the second embodiment;

FIG. 21 is a flow diagram showing the first stage of the process ofgenerating restrictions between possible translation unit combinationsaccording to a third embodiment;

FIG. 22 is a flow diagram showing the second stage in the process of thethird embodiment;

FIG. 23 (comprising FIGS. 23 a and 23 b) is a flow diagram showing thethird stage in the process of the third embodiment;

FIG. 24 is a flow diagram showing the operation of a preferredembodiment of the invention in generating new translation units;

FIG. 25 (comprising FIGS. 25 a, 25 b and 25 c) is a flow diagram showingthe process of word match scoring comprising part of the process of FIG.24; and

FIG. 26 is a flow diagram showing the process of word alignment andscoring forming part of the process of FIG. 24.

FIRST EMBODIMENT

FIG. 1 shows apparatus suitable for implementing the present invention.It consists of a work station 100 comprising a keyboard 102, computer104 and visual display unit 106. For example, the work station 100 maybe a high performance personal computer or a sun work station.

FIG. 2 shows the components of a computer 104 of FIG. 1, comprising aCPU 108 (which may be a Pentium III or reduced instruction set (RISC)processor 108). Connected to the CPU is a peripheral chip set 112 forcommunicating with the keyboard, VDU and other components; a memory 114for storing executing programs and working data; and a store 110 storingprograms and data for subsequent execution. The store 110 comprises ahard disk drive; if the hard disk drive is not removable then the store110 also comprises a removable storage device such as a floppy diskdrive to allow the input of stored text files.

FIG. 3 illustrates the programs and data held on the store 110 forexecution by the CPU 108. They comprise a development program 220 and atranslation program 230.

The development program comprises a mapping program 222 operating on asource text file 224 and a target text file 226. In this embodiment, italso comprises a source lexicon 234 storing words of the source languagetogether with data on their syntactic and semantic properties, and atarget language lexicon 236 storing similar information from the targetlanguage, together with mapping data (such as the shared identifiers ofthe Eurowordnet Lexicon system) which link source and target words whichare translations of each other.

The translation program comprises a translation data store 232 storestranslation data in the form of PROLOG rules, which are defined by therelationships established by the mapping program 222. A translationlogic program 238 (for example a PROLOG program) defines the steps to betaken by the translation program using the rules 232, and a logicinterpreter program 239 interprets the translation logic and rules intocode for execution by the CPU 108.

Finally, an operating system 237 provides a graphic user interface,input/output functions and the well known functions. The operatingsystem may, for example, be Microsoft Windows™, or Unix or Linuxoperating in conjunction with X-Windows.

FIG. 4 is an overview of the translation process. Source language text(A) is parsed to provide data representing a source surface tree (B)corresponding to data defining a source dependency structure (C), whichis associated with a target dependency structure (D). The targetdependency structure is then employed to generate a target surface tree(E) structure, from which target language text (F) is generated.

These steps will be discussed in greater detail below. First, however,the process performed by the development program 220 in providing thedata for use in subsequent translations will be discussed.

Development Program

Referring to FIG. 5, in a step 402, the mapping program 222 creates ascreen display (shown in FIG. 6) comprising the words of a firstsentence of the source document and the corresponding sentence of thetranslation document (in this case, the source document has the sentence“I like to swim” in English, and the target document has thecorresponding German sentence “Ich schwimme gern”). Each word is dividedwithin a graphic box 1002-1008, 1010-1014. The mapping program allowsthe user to move the words vertically, but not to change their relativehorizontal positions (which correspond to the actual orders ofoccurrence of the words in the source and target texts).

The user (a translator or linguist) can then draw (using the mouse orother cursor control device) dependency relationship lines (“links”)between the boxes containing the words. In this case, the user hasselected “swim” (1008) as the “head” word in the English text and “I”(1002), “like” (1004) “to” (1006) as the “daughters” by drawingdependency lines from the head 1008 to each of the daughters 1002-1006.

At this point, it is noted that all of the daughters 1002-1006 in thesource language in this case lie to the left of the head 1008; they aretermed “left daughters”. One of the heads is marked as the surface rootof the entire sentence (or, in more general terms, block of text).

In the target language text of FIG. 6, it will be seen that “Ich” (1010)lies to the left of “schwimme” (1012) and is therefore a “leftdaughter”, whereas “gern” (1014) lies to the right and is therefore a“right daughter”. Left and right daughters are not separately identifiedin the dependency graphs but will be stored separately in the surfacegraphs described below.

The editing of the source graph (step 404) continues until the user haslinked all words required (step 406). The process is then repeated(steps 408, 410, 412) for the target language text (1012-1014).

Once the dependency graphs have been constructed for the source andtarget language texts, in step 414 the program 222 allows the user toprovide connections between words in the source and target languagetexts which can be paired as translations of each other. In this case,“I” (1002) is paired with “Ich” (1010) and “swim” (1008) with “schwimme”(1012).

Not every word in the source text is directly translatable by a word inthe target text, and the user will connect only words which are a gooddirect translation of each other. On slightly more general terms, wordsmay occasionally be connected if they are at the heads of a pair ofphrases which are direct translations, even if the connected wordsthemselves are not.

However, it is generally the case in this embodiment that the connection(alignment) indicates not only that phrases below the word (if any) area transaction pair but that the head words themselves also form such apair.

When the user has finished (step 416), it is determined whether furthersentences within the source and target language files remain to beprocessed and, if not, the involvement of the user ends and the userinterface is closed. If further sentences remain, then the next sentenceis selected (step 420) and the process resumes as step 402. At thisstage, the data representing the translation examples now consists of aset of nodes, some of which are aligned (connected) with equivalents inthe other language; translation unit records; and links between them todefine the graph.

The present invention also provides for automatic alignment of thesource and target language graphs, as will be disclosed in greaterdetail below.

Processing the Example Graph Structure Data

Referring to FIG. 7, the process performed in this embodiment by thedevelopment program 220 is as follows. In step 502, a dependency graph(i.e. the record relating to one of the sentences) is selected, and instep 504, redundant structure is removed (see below).

In step 510, a relative clause transform process (described in greaterdetail below) is performed. This is achieved by making a copy of thedependency graph data already generated, and then transforming the copy.The result is a tree structure.

In step 550, a topic shift transform process is performed (described ingreater detail below) on the edited copy of the graph. The result is aplanar tree retaining the surface order of the words, and this is storedwith the original dependency graph data in step 580.

Finally, in step 590, each graph is split into separate graph units.Each graph unit record consists of a pair of head words in the sourceand target languages, together with, for each, a list of right daughtersand a list of left daughters (as defined above) in the surface treestructure, and a list of daughters in the dependency graph structure. Instep 582, the next dependency graph is selected, until all areprocessed.

Removal of Redundant Layers

Step 504 will now be discussed in more detail. FIG. 8 illustrates themarked up dependency graph for the English phrase “I look for the book”and the French translation “Je cherche le livre”.

In the English source text, the word “for” (1106) is not aligned with aword in French target text, and therefore does not define a translatableword or phrase, in that there is no subset of words that “for” dominates(including itself) that is a translation of a subset of words in thetarget language. Therefore, the fact that the word “for” dominates“book” does not assist in translation.

In this embodiment, therefore, the superfluous structure represented by“for” between “look” 1104 and “book” 1110 is eliminated. Thesemodifications are performed directly on the dependency data, to simplifythe dependency graph.

Referring to FIGS. 9 and 10, in step 505, a “leaf” node (i.e.hierarchically lowest) is selected and then in step 506, the next nodeabove is accessed. If this is itself a translation node (step 507), thenthe process returns to step 505 to read the next node up again.

If the node above is not a translation node (step 507) then the nextnode up again is read (step 508). If that is a translation node (step509), then the original node selected in step 505 is unlinked andre-attached to that node (step 510). If not, then the next node up againis read (step 508) until a translation node is reached. This process isrepeated for each of the nodes in turn, from the “leaf” nodes up thehierarchy, until all are processed. FIG. 10 shows the link between nodes1106 and 1110 being replaced by a link from node 1104 to node 1110.

The removal of this redundant structure greatly simplifies theimplementation of the translation system, since as discussed below eachtranslation component can be made to consist of a head and its immediatedescendents for the source and target sides. There are no intermediatelayers. This makes the translation components look like aligned grammarrules (comparable to those used in the Rosetta system), which means thata normal parser program can be used to perform the source analysis andthereby produce a translation.

Producing A Surface Tree

The next step performed by the development program 220 is to process thedependency graphs derived above to produce an associated surface tree.The dependency graphs shown in FIG. 6 are already in the form of planartrees, but this is not invariably the case.

The following steps will use the dependency graph to produce a surfacetree structure, by making and then transforming a copy of the processeddependency graph information derived as discussed above.

Relative Clause Transformation (“Relativisation”)

FIG. 11 shows the dependency graph which might be constructed by theuser for the phrase “I know the cat that Mary thought John saw” inEnglish, consisting of nodes 1022-1038. In a relative clause such asthat of FIG. 11, the dependency graph will have more than one root,corresponding to the main verb (“know”) and the verbs of dependentclauses (“thought”). The effect is that the dependency graph is not atree, by virtue of having two roots, and because “cat” (1028) isdominated by two heads (“know” (1024) and “saw” (1038)).

Referring to FIGS. 12 and 13, and working on the assumption that thedependency graphs comprise a connected set of trees (one tree for eachclause) joined by sharing common nodes, of which one is the principaltree, an algorithm for transforming the dependency graph into a tree isthen;

-   Start with the principal root node as the current node.    -   Mark the current node as ‘processed’.    -   For each child of the current node,        -   check whether this child has an unprocessed parent.            -   For each such unprocessed parent, find the root node                that dominates this parent (the subordinate root).            -   Detach the link by which the unprocessed parent                dominates the child and            -   Insert a link by which the child dominates the                subordinate root.    -   For each daughter of the current node,        -   make that daughter the current node and continue the            procedure until there are no more nodes.

As FIG. 12 shows, in step 512, it is determined whether the last node inthe graph has been processed, and, if so, the process ends. If not, thenin step 514 the next node is selected and, in step 516, it is determinedwhether the node has more than one parent. Most nodes will only have oneparent, in which case the process returns to step 514.

Where, however, a node such as “cat” (1028) is encountered, which hastwo parents, the more subordinate tree is determined (step 518) (as thatnode which is the greater number of nodes away from the root node of thesentence), and in step 520, the link to it (i.e. in FIG. 11, the linkbetween 1038 and 1028) is deleted.

In step 522, a new link is created, from the node to the root of themore subordinate tree. FIG. 13 shows the link now created from “cat”(1028) to “thought” (1034).

The process then returns to step 516, to remove any further links untilthe node has only one governing node, at which point step 516 causesflow to return to step 514 to process the next node, until all nodes ofthat sentence are processed.

This process therefore has the effect of generating from the originaldependency graph an associated tree structure. Thus, at this stage thedata representing the translation unit comprises a version of theoriginal dependency graph simplified, together with a transformed graphwhich now constitutes a tree retaining the surface structure.

Topic Shift Transformation (“Topicalisation”)

The tree of FIG. 13 is a planar tree, but this is not always the case;for example where a phrase (the topic) is displaced from its “logical”location to appear earlier in the text. This occurs, in English, in“Wh-” questions, such as that shown in FIG. 14, showing the question“What did Mary think John saw?” in English, made up of the nodes1042-1054 corresponding respectively to the words. Although thedependency graph here is a tree, it is not a planar tree because thedependency relationship by which “saw” (1052) governs “what” (1042)violates the projection constraint.

Referring to FIGS. 14 to 16, the topic shift transform stage of step 550will now be described in greater detail. The algorithm operates on agraph with a tree-topology, and so it is desirable to perform this stepafter the relativisation transform described above.

The general algorithm is, starting from a “leaf” (i.e. hierarchicallylowest) node,

-   -   For each head (i.e. aligned) word, (the current head), identify        any daughters that violate the projection (i.e. planarity)        constraint (that is, are there intervening words that this word        does not dominate either directly or indirectly?)        -   For each such daughter, remove the dependency relation            (link) and attach the daughter to the governing word of the            current head.    -   Continue until there are no more violations of the projection        constraint

For each head word until the last (step 552), for the selected head word(step 544), for each link to a daughter node until the last (step 556),a link to a daughter node (left most first) is selected (step 558). Theprogram then examines whether that link violates the planarityconstraint, in other words, whether there are intervening words in theword sequence between the head word and the daughter word which are notdominated either direct or indirectly by that head word. If theprojection constraint is met, the next link is selected (step 558) untilthe last (step 556).

If the projection constraint is not satisfied, then the link to thedaughter node is disconnected and reattached to the next node up fromthe current head node, and it is again examined (step 560) whether theplanarity constraint is met, until the daughter node has been attachedto a node above the current head node where the planarity constraint isnot violated.

The next link to a daughter node is then selected (step 558) until thelast (step 556), and then the next head node is selected (step 554)until the last (step 552).

Accordingly, after performing the topicalisation transform of FIG. 15,the result is a structure shown in FIG. 16 which is a planar treeretaining the surface structure, and corresponding to the originaldependency graph.

Splitting the Graphs Into Translation Units

After performing the topicalisation and relativisation transforms, thedata record stored comprises, for each sentence, a dependency graph anda surface tree in the source and target languages. Such structures couldonly be used to translate new text in which those sentences appearedverbatim. It is more useful to split up the sentences into smallertranslation component units (corresponding, for example, to shortphrases), each headed by a “head” word which is translatable between thesource and target languages (and hence is aligned or connected in thesource and target graphs).

Accordingly, in step 590, the development program 220 splits each graphinto a translation unit record for each of the aligned (i.e. translated)words.

Each translation unit record consists of a pair of head words in thesource and target languages, together with, for each, a list of rightsurface daughters and a list of left surface daughters, and a list ofthe dependency graph daughters. These lists may be empty. The fieldsrepresenting the daughters may contain either a literal word (“like” forexample) or a placeholder for another translation unit. A record of thetranslation unit which originally occupied the placeholder (“I” forexample) is also retained at this stage. Also provided are a list of thegap stack operations performed for the source and target heads, and thesurface daughters.

The effect of allowing such placeholders is thus that, in a translationunit such as that headed by “swim” in the original sentence above, theplace formerly occupied by “I” can now be occupied by anothertranslation unit, allowing it to take part in other sentences such as“red fish swim”. Whereas in a translation system with manually craftedrules the translation units which could occupy each placeholder would besyntactically defined (so as to allow, for example, only a singular nounor noun phrase in a particular place), in the present embodiment thereare no such restraints at this stage.

During translation, using PROLOG unification operations, the surfaceplaceholder variables are unified with the dependency placeholders, andany placeholders involved in the gap stack operations. The sourcedependency placeholders are unified with corresponding target dependencyplaceholders.

The source surface structures can now be treated as straightforwardgrammar rules, so that a simple chart parser can be used to produce asurface analysis tree of new texts to be translated, as will bediscussed in greater detail below.

It is to be noted that, since the process of producing the surface treesalters the dependencies of daughters upon heads, the lists of daughterswithin the surface trees will not identically match those within thedependency graphs in every case, since the daughter of one node may havebeen shifted to another in the surface tree, resulting in it beingdisplaced from one translation unit record to another; the manner inwhich this is handled is as follows:

Where the result of forming the transformation to derive the surfacestructure is to display a node in the surface representation from onetranslation unit to another, account is taken of this by using a stackor equivalent data structure (referred to in PROLOG as a “gap thread”and simulated using pairs of lists referred to as “threads”).

For translation units where the list of surface daughter nodes containsan extra node relative to the dependency daughters or vice versa as aresult of the transformation process), the translation unit recordincludes an instruction to pull or pop a term from the stack, and unifythis with the term representing the extra dependent daughter.

Conversely, where a translation unit contains an extra surface daughterwhich does not have an associated dependent daughter term, the recordcontains an instruction to push a term corresponding to that daughteronto the stack. The term added depends upon whether the additionaldaughter arose as a result of the topicalisation transform or therelativisation transform.

Thus, in subsequent use in translation, when a surface structure ismatched against input source text and contains a term which cannot beaccounted for by its associated dependency graph, that term is pushed onto the stack and retrieved to unify with a dependency graph of adifferent translation unit.

Since this embodiment is written in PROLOG, the representation betweenthe surface tree, the gap stack and the dependency structure can be madesimply by variable unification. This is convenient, since therelationship between the surface tree and the dependency structure isthereby completely bi-directional. This enables the relationships usedwhile parsing the source text (or rather, their target text equivalents)to be used in generating the target text. It also ensures that thetranslation apparatus is bi-directional; that is, it can translationfrom A to B as easily as from B to A.

Use of a gap stack in similar manner to the present embodiment isdescribed in Pereira F 1981, ‘Extraposition Grammars’, American Journalof Computational Linguistics, 7 4 pp. 243-256, and Alshawi H 1992, TheCore Language Engine, MIT Press Cambridge, incorporated herein byreference.

Consider once more the topicalisation transform illustrated by thegraphs in FIGS. 14 and 16. The source sides of the translation unitsthat are derived from these graphs are (slightly simplified forclarity),

-   -   component #0:        -   head=‘think’        -   left surface daughters=[‘what’,‘did’,‘mary’],        -   right surface daughters=[#1]        -   dependent daughters=[‘did’,‘mary’,#1]    -   component #1:        -   head=‘saw’,        -   left surface daughters=[‘john’],        -   right surface daughters=[ ]        -   dependent daughters=[‘john’,‘what’]

It can be seen that in component #0 we have ‘what’ in the surfacedaughters list, but not in the dependant daughters list. Conversely,component #1 has ‘what’ in its dependent daughters list, but not in itssurface daughters list.

In component #0, it was the daughter marked #1 that contributed theextra surface daughter when the dependency graph to surface tree mappingtook place. So, we wish to add ‘what’ to the gap stack for thisdaughter. Conversely, in component #1, we need to be able to remove aterm from the gap stack that corresponds to the extra dependent daughter(‘what’) in order to be able to use this component at all. Therefore,the head of this component will pop a term off the gap stack, which itwill unify with the representation of ‘what’. The modified source sidecomponent representations then look like this, component #0:

-   -   head=‘think’        -   left surface daughters=[‘what’,‘did’,‘mary’],        -   right surface daughters=[#1:push(Gapstack,‘what’)]        -   dependent daughters=[‘did’,‘mary’,#1]    -   component #1:        -   head=‘saw’, pop(Gapstack, ‘what’),        -   left surface daughters=[‘john’],        -   right surface daughters=[ ]        -   dependent daughters=[‘john’,‘what’]

The components for a relativisation transform look a little different.To illustrate this, consider the example in FIGS. 11 and 13. In thisexample there will be an extra root node in the dependency structure.That means that there will be a component with an extra surface daughterand this surface daughter will cause the head of the component to bepushed onto the gap stack. In this example, ‘cat’ is the head of therelevant component and ‘thought’ is the surface daughter (of ‘cat’) thatwill push the representation of ‘cat’ onto its gap stack. This will havethe effect of disconnecting ‘thought’ in the dependency graph, so makingit a root, and making ‘cat’ a dependent daughter of whichever head popsit off the gap stack (in this case ‘saw’).

The representation then for the source side of the graphs in FIGS. 11and 13 are (again simplified for clarity),

-   -   component #0:        -   head=‘know’        -   left surface daughters=[‘I’],        -   right surface daughters=[#1]        -   dependent daughters=[‘I’,#1]    -   component #1:        -   head=‘cat’,        -   left surface daughters=[‘the’],        -   right surface daughters=[#2:push(Gapstack,‘cat’)]        -   dependent daughters=[‘the’]    -   component #2:        -   head=‘thought’,        -   left surface daughters=[‘that’,‘mary’],        -   right surface daughters=[#3],        -   dependent daughters=[‘that’,‘mary’,#3]    -   component=#3:        -   head=‘saw’:pop(Gapstack,X),        -   left surface daughters=[‘john’],        -   right surface daughters=[ ],        -   dependent daughters=[‘john’,X]

This example shows ‘cat’ being added to the gap stack for the daughter#2 of component #1. Also, a term (in this case a variable) is popped offthe gapstack at the head of component #3. This term is unified with thedependent daughter of component #3.

Translation

Further aspects of the development program will be considered later.

However, for a better understanding of these aspects, it will beconvenient at this stage to introduce a description of the operation ofthe translation program 230. This will accordingly be discussed.

The source surface structures within the translation components aretreated in this embodiment as simple grammar rules so that a surfaceanalysis tree is produced by the use of a simple chart parser, asdescribed for example in James Allen, “Natural Language Understanding”,second edition, Benjamin Cummings Publications Inc., 1995, but modifiedto operate from the head or root outwards rather than from right to leftor vice versa. The parser attempts to match the heads of source surfacetree structures for each translation unit against each word in turn ofthe text to be translated. This produces a database of packed edgesusing the source surface structures, which is then unpacked to find ananalysis.

The effect of providing a unification of the surface tree terms and thedependency tree terms using the stack ensures that the source dependencystructure is created at the same time during unpacking.

Whilst the actual order of implementation of the rules represented bythe surface and dependency structures is determined by the logicinterpreter 239, FIGS. 17 and 18 notionally illustrate the process.

In a step 602 of FIG. 17, a sentence of the source language file to betranslated is selected. In step 610, a source surface tree of a languagecomponent is derived using the parser, which reproduces the word orderin the input source text. In step 620, the corresponding dependencygraph is determined. In step 692, from the source dependency graph, thetarget dependency graph is determined. In step 694, from the targetdependency graph, the target surface tree is determined, and used togenerated target language text, in step 696, the target language text isstored. The process continues until the end of the source text (step698).

FIGS. 18 a and 18 b illustrate steps 610 to 694 in greater detail. Instep 603, each surface structure is compared in turn with the inputtext. Each literal surface daughter node (node storing a literal word)has to match a word in the source text string exactly. Each alignedsurface daughter (i.e. surface daughter corresponding to a furthertranslation unit) is unified with the source head record of atranslation unit, so as to build a surface tree for the source text.Most possible translation units will not lead to a correct translation.Those for which the list of daughters cannot be matched are rejected ascandidates.

Then, for each translation unit in the surface analysis, using thestored stack operations for that unit in the PROLOG unification process,the stack is operated (step 608) to push or pull any extra or missingdaughters. If (step 610) the correct number of terms cannot be retrievedfor the dependency structure then the candidate structure is rejectedand the next selected until the last (step 612). Where the correcttranslation components are present, exactly the correct number ofdaughters will be passed through the stack.

Where a matching surface and dependency structure (i.e. an analysis ofthe sentence) is found (step 610), then, referring to FIG. 18 b, foreach translation unit in the assembled dependency structure, thecorresponding target head nodes are retrieved (step 622) so as toconstruct the corresponding target dependency structure. The transferbetween the source and target languages thus takes place at the level ofthe dependency structure, and is therefore relatively unaffected by thevagaries of word placement in the source and/or target languages.

In step 626 the stack is operated to push or pop daughter nodes. In step628, the target surface structure is determined from the targetdependency structure.

In step 630, the root of the entire target surface structure isdetermined by traversing the structure along the links. Finally, in step632, the target text is recursively generated by traversing the targetsurface structure from the target surface root component, using PROLOGbacktracking if necessary, to extract the target text from the targetsurface head and daughter components.

SECOND EMBODIMENT Generalisation of Translation Units

Having discussed the essential operation of the first embodiment,further preferred features (usable independently of those describedabove) will now be described.

Translation units formed by the processes described above consist, forthe target and source languages, of a literal head (which is translated)and a number of daughters which may be either literal or non-literal,the latter being variable representing connection points for othertranslation units. Using a translation unit, each of the literaldaughters has to match the text to be translated exactly and each of thenon-literal daughters has to dominate another translation unit.

The set of rules (which is what the translation unit data now comprise)were derived from example text. The derivation will be seen to havetaken no account of syntactic or semantic data, except in so far as thiswas supplied by the human user in marking up the examples. Accordingly,the example of a particular noun, with, say, one adjective cannot beused to translate that noun when it occurs with zero, or two or more,adjectives. The present embodiment provides a means of generalising fromthe examples given. This reduces the number of examples required for aneffective translation system or, viewed differently, enhances thetranslation capability of a given set of examples.

Generalisation is performed by automatically generating new “pseudotranslation units”, whose structure is based on the actual translationunits derived from marked up examples. Pseudo translation units areadded when this reduces the number of distinct behaviours of the setsource-target head pairs. In this case, a ‘behaviour’ is the set of alldistinct translation units which have the same source-target head pair.

FIG. 19 (comprising FIGS. 19 a-19 f) shows 6 example texts ofFrench-English translation pairs; in FIG. 19 a the source head is “car”,with left daughters “the” and “white”, and the target head is “voiture”with left daughter “la” and right daughter “blanche”; similarly FIG. 19b shows the text “the white hat” (“Le chapeau blanc”); FIG. 19 c showsthe text “the car” (“la voiture”); FIG. 19 d shows the text “the hat”(“le chapeau”); FIG. 19 e shows the text “the cat” (“le chat”); and FIG.19 f shows the text “the mouse” (“la souris”).

On the basis of only these example texts, the translation systemdescribed above would be unable to translate phrases such as “the whitemouse” or “the white cat”.

Referring to FIG. 20, in a step 702, the development program 220 readsthe translation units stored in the store 232 to locate analogous units.To determine whether two translation units are analogous, the source andtarget daughter lists are compared. If the number of daughters is thesame in the source lists and in the target lists of a pair oftranslation units, and the literal daughters match, then the twotranslation units are temporarily stored together as being analogous.

After performing step 702, there will therefore be temporarily stored anumber of sets of analogous translation units. Referring to thetranslation examples in FIGS. 19 a-f, the unit shown in FIG. 19 d withbe found to be analogous to that of FIG. 19 e and the unit shown in FIG.19 c is analogous to that shown in FIG. 19 f. Although the source sidesof all four are equivalent (because the definite article in English doesnot have masculine and feminine versions) the two pairs are notequivalent in their target daughter list.

For each pair of analogous translation units that were identified whichdiffer in their source and target headwords, a third translation unit islocated in step 704 which has the same source-target head pair as one ofthe analogous pair, but different daughters. For example, in relation tothe pair formed by FIGS. 19 d and 19 e, FIG. 19 b would be selected instep 704 since it has the same heads as the unit of FIG. 19 d.

In step 706, a new translation unit record is created which takes thesource and target heads of the second analogous unit (in other words notthe heads of the third translation unit), combined with the list ofdaughters of the third translation unit. In this case, the translationunit generated in step 706 for the pair units of 18 d and 18 e using theunit of FIG. 19 b would be;

-   -   SH7=Cat    -   SD1=The    -   SD2=White    -   TH7=Chat    -   TD1=Le    -   TD2=Blanc

Similarly, the new translation unit formed from the analogous pair ofFIGS. 19 e and 19 f using translation of unit of FIG. 19 a would be asfollows;

-   -   SH8=Mouse    -   SD1=The    -   SD2=White    -   TH8=Souris    -   TD1=La    -   TD2=Blanche

Accordingly, the translation development program 220 is able to generatenew translation examples, many of which will be syntactically correct inthe source and target languages.

In the above examples, it will be seen that leaving the function words,such as determiners (“the”, “le”, “la”) as literal strings in the sourceand target texts of the examples, rather than marking them up astranslation units, has the benefit of preventing over-generalisation(e.g. ignoring adjective-noun agreements).

Although the embodiment as described above functions effectively, itcould also be possible in this embodiment to make use of the source andtarget language lexicons 234, 236 to limit the number of pairs which areselected as analogous.

For example, pairs might be considered analogous only where the sourcehead words likewise the target heads of the two are in the samesyntactic category. Additionally or alternatively, the choice of thirdunit might be made conditional on the daughters of the third unitbelonging to the same syntactic category or categories as the daughtersof the first and second units. This is likely to reduce the number oferroneous generalised pairs produced without greatly reducing the numberof useful generalisations.

Where the generalisation of the above described embodiment is employedwith the first embodiment, it is employed after the processes describedin FIG. 7.

THIRD EMBODIMENT Creating and Using Head/Daughter Restrictions

If, as described in the first embodiment, any daughter may select anyhead during translation, many incorrect translations will be produced(in addition to any correct translations which may be produced). If thegeneralisation process described in the preceding embodiments isemployed, this likelihood is further increased. If a number oftranslations would be produced, it is desirable to eliminate those whichare not linguistically sound, or which produce linguistically incorrecttarget.

A translation system cannot guarantee that the source text itself isgrammatical, and so the aim is not to produce a system which refuses togenerate ungrammatical target text, but rather one which, given multiplepossible translation outputs, will result in the more grammaticallycorrect, and faithful, one.

The system of the present embodiments does not, however, have access tosyntactic or semantic information specifying which heads should combinewith which daughters. The aim of the present embodiment is to acquiredata to perform a similar function by generalising the combinations ofunits which were present, and more specifically, those which cannot havebeen present, in the example texts.

Accordingly, in this embodiment, the data generated by the developmentprogram 220 described above from the marked up source and targettranslation text is further processed to introduce restrictions on thecombinations of head and daughters words which can be applied ascandidates during the translation process.

The starting point is the set of translation pairs that were used toproduce the translation units (with, possibly, the addition of new pairsalso).

Inferring Restrictions

Accordingly, in this embodiment, restrictions are developed by thedevelopment program 220. Where the generalisation process of thepreceding embodiments is used, then this embodiment is performed afterthe generalisation process. Additionally, the translation units producedby generalisation are marked by storing a generalisation flag with thetranslation unit record.

Referring to FIG. 21, in a step 802 the development program 220 causesthe translator program 230 to execute on the source and the targetlanguage sample texts stored in the files 224, 226.

Where the translation apparatus is intended to operate onlyunidirectionally (that is from the source language to the targetlanguage) it will only be necessary to operate on the source language(for example) texts; in the following, this will be discussed, but itwill be apparent that in a bidirectional translation system as in thisembodiment, the process is also performed in the other direction.

In step 804, one of the translations (there are likely to be severalcompeting translations for each sentence) is selected and is comparedwith all of the target text examples. If the source-target text pairproduced by the translation system during an analysis operation appearsin any of the examples (step 808) that analysis is added to a “correct”list (step 810). If not it is added to an “incorrect” list (step 812).

If the last translation has not yet been processed (step 814), the nextis selected in step 804. The process is then repeated for alltranslations of all source text examples.

The goal of the next stage is to eliminate the incorrect analyses of theexample texts.

Accordingly, referring to FIG. 22, each incorrect analysis from the listproduced by the process of FIG. 21 is selected (step 822), and in step824, the source analysis surface structure graph (tree) and the sourceanalysis dependency structure are traversed to produce separate lists ofthe pairs of heads and daughters found within the structure. The resultis a list of surface head/daughter pairs and a list of dependenthead/daughter pairs. The two lists will be different in general since,as noted above, the surface and dependent daughters are not identicalfor many translation units.

This process is repeated for each analysis until the last is finished(step 826).

Having compiled surface and dependent head/daughter pair sets for eachincorrect analysis, in step 828, a subset of head/daughter pairs isselected, so as to be the smallest set which, if disabled, would removethe largest number (preferably all) of incorrect analyses.

It will be recalled that when the original graphs were separated intotranslation components, the identities of the components occupying thedaughter positions were stored for each. So as to avoid eliminating anyof the head/daughter pairs which actually existed in the annotatedsource-target examples, these original combinations are removed from thepair lists.

The process of finding the smallest subset of head/daughter pairs to bedisabled which would eliminate the maximum number (i.e. all) of theincorrect analyses is performed by an optimisation program, iterativelydetermining the effects of those of the head/daughter pairs which werenot in the original examples.

It could, for example, be performed by selecting the head/daughter pairwhich occurs in the largest number of incorrect translations andeliminating that; then, of the remaining translations, continuing byselecting the head/daughter pair which occurs in the largest number andeliminating that; and so on, or, in some cases, a “brute force”optimisation approach could be used.

The product of this step is therefore a pair of lists (one for thesurface representation and one for the dependency representation) ofpairs of head words and daughter words which cannot be combined.Generally, there is a pair of lists for each of the source and targetsides.

Thus, these pairs could, at this stage, be stored for subsequent use intranslation so that during the analysis phase of translation, therespective combinations are not attempted, thus reducing the time takento analyse by reducing the number of possible alternative analyses, andeliminating incorrect analyses.

Having found and marked the pairs as illegal in step 830, however, it isthen preferred to generalise these restrictions on head/daughter pairingto be able to select between competing analyses for, as yet, unseensource texts beyond those stored in the example files 224.

To do this, a principle is required which is capable of selecting the“best” generalisation from amongst all those which are possible.According to this embodiment, the preferred generalisation is that whichis simplest (in some sense) and which remains consistent with theexample data.

This is achieved as follows: A data structure is associated with eachtranslation unit and each aligned daughter; in this embodiment, it is anattribute-value matrix (as is often used to characterise linguisticterms) although other structures could be used.

An aligned daughter may only dominate a translation unit if theassociated data structures “match” in some sense (tested for example byPROLOG unifications).

The restrictions are generalised by choosing to minimise the numbers ofdistinct attribute-value matrices required to produce translations whichare consistent with the original translation examples. A daughter canonly select a particular head during translation if the head anddaughter attribute-value matrices can be matched.

Initially, from the list of illegal head/daughter pairings produced bythe process describe above, it is known from the example data that someheads cannot combine with some daughters. However, because the exampledata is incomplete, it is likely that for each such head, there are alsoother daughters with which it cannot combine which happen not to havebeen represented in the example texts (similarly, for each daughterthere are likely to be other heads with which that daughter cannotcombine).

In the following process, therefore, the principle followed is thatwhere a first head cannot combine with a first set of daughters, and asecond head cannot combine with a second set of daughters, and there isa high degree of overlap between the two lists of daughters, then thetwo heads are likely to behave alike linguistically, and accordingly, itis appropriate to prevent each from combining with all of the daughterswith which the other cannot combine.

Exactly the same is true for the sets of heads for which each daughtercannot combine. The effect is thus to coerce similar heads into behavingidentically and similar daughters into behaving identically, thusreducing the number of different behaviours, and generalising behavioursfrom a limited set of translation examples.

Referring to FIG. 23 a, in step 832, a first head within the set ofillegal head/daughter pairs is located (the process is performed foreach of the surface and dependency sets, but only one process will herebe described for clarity). The daughters which occur with all otherinstances of that head in the set are collected into a set of illegaldaughters for that head (step 834).

When (step 836) the operation has been repeated for each distinct headin the set, then in step 842, a first daughter is selected from the setof illegal pairs, and (similarly) each different head occurring with allinstances of that daughter in the set of pairs are compiled into a setof illegal heads for that daughter (step 844). When all daughter andhead sets have been compiled (both for the surface and for thedependency lists of pairs) (step 846) the process passes to step 852 ofFIG. 23 b.

In step 852, the set of heads (each with a set of daughters with whichit cannot combine) is partitioned into a number of subsets. All headswith identical daughter sets are grouped and stored together to form asubset. The result is a number of subsets corresponding to the number ofdifferent behaviours of heads.

In step 854, the same process is repeated for the set of daughters, soas to partition the daughters into groups having identical sets ofheads.

Next, in step 856, it is determined whether all the head and daughtersubsets are sufficiently dissimilar to each other yet. For example, theymay be deemed dissimilar if no subset has any daughter in common withanother. Where this is the case (step 856), the process finishes.

Otherwise, the two subsets of heads with the most similar daughter sets(i.e. the largest number of daughters in common—the largestintersection) are found (step 857). Similarly, in step 858, the two mostsimilar subsets of daughters (measured by the number of heads they havein common) are found.

In step 859 it is tested whether the merger of the two head sets, andthe two daughter sets, would be allowable. It is allowable unless themerger would have the effect of making illegal a combination of head anddaughter that occurred in the example texts (and hence disabling a validtranslation). If unallowable, the next most similar sets are located(step 857, 858).

If the merger is allowable, then (step 860) the two head sets aremerged, and the daughter sets of all heads of the merged subset becomesthe union of the daughter sets of the two previous subsets (that is,each head inherits all daughters from both subsets). Similarly, the twodaughter sets are merged, and the head sets for each daughter become theunion of the two previous head sets.

The process then returns to step 856, until the resulting subsets areorthogonal (that is, share no common members within their lists). Atthis point, the process finishes, and the resulting subsets are combinedto generate a final set of head/daughter pairs which cannot be combinedin translation.

This is then stored within the rules database 232, and applied duringsubsequent translations to restrict the heads selected to unite witheach daughter during analysis. As mentioned above, separate sets aremaintained for the surface representation and for the dependencyrepresentation.

Thus, this embodiment, like the last, simplifies and generalises thebehaviours exhibited by translation components. While the precedinggeneralisation embodiment operated to expand the range of possibletranslation units, the present embodiment operates to restrict the rangeof legal translations which can be produced by generalising restrictionson translation unit combinations.

Automatic Alignment and Generation of New Translation Units from NewSample Translations

In this embodiment, the invention is arranged to provide new translationunits partly or completely automatically.

When a translator provides a new translation, the original text in thesource language and the translated text in the target language form asource-target pair from which new translation units can be generated.This pair is input into the translation system for processing by thetranslation development program.

In this embodiment, as in those described above, a human user (who mayor may not be the translator) can mark up the source language text andthe target language text to indicate dependencies, and can then mark upalignments between the source language text and the target language text(i.e. pairs of words which are translations of each other).

In this embodiment, one or both of these steps is automated. If thehuman user (or one user in the source language and another in the targetlanguage) has already marked up the dependencies in the source andtarget language text, then this information may be used and the presentembodiment can proceed to step 2006.

If not, then in step 2002, the translation development program performsa translation on the source language text, sentence by sentence, togenerate one or more target texts, and compares them with the inputtarget language text. If one of the translations matches the actualtext, there is no need to proceed further, since the existing storedtranslation units can translate the text.

If not, then in step 2004, the translation development program performsa translation on the input target language text. Thus, at this stage,for each sentence in the source language text and corresponding sentencein the target language text, there are one or more source languageanalyses and one or more target language analyses, built using theexisting stored translation units, but no match between them.

Each analysis includes the identification of a root node of the sentence(or the principal root where there is more than one), and a dependencystructure relating each other word in the sentence directly orindirectly to the root node. In general, there may be several analysis,and the “correct” one is not known from the outset.

Next, for each sentence, in step 2006, the translation developmentprograms selects a first pair of analyses (i.e. a first source languageanalysis and a first target language analysis), and selects a firstsource word within the source analysis in step 2008.

In step 2010, as will be described in greater detail with reference toFIG. 25, the translation development program calculates part of a matrixrelating that source word to each of the words in the target analysis,to indicate the strength of correspondence between the source word andeach of the words in the target analysis (ideally, the matrix wouldindicate a strong likelihood that some of the source words eachcorrespond to one, and only one, word in the target analysis).

Indicating the i words of the source text as s₁, s₂, s₃, . . . s_(i),the j words of the target text as t₁, t₂, t₃, . . . t_(j), and thelikelihood that the jth target word is a translation of the ith sourceword as s_(i)t_(j) then the matrix is as follows:   TARGET${SOURCE}\begin{pmatrix}{S_{1}t_{1}} & {S_{1}t_{2}} & {S_{1}t_{3}} & \cdots & {S_{1}t_{j}} \\{S_{2}t_{1}} & {S_{2}t_{2}} & {S_{2}t_{3}} & \cdots & {S_{2}t_{j}} \\\vdots & \quad & \quad & \quad & \quad \\{S_{i}t_{1}} & {S_{1}t_{2}} & {S_{i}t_{3}} & \cdots & {S_{i}t_{1}}\end{pmatrix}$

Instead of the above “likelihood”, semantic similarity, or other suchmeasure of affinity may be used instead.

In step 2012, the next source word is selected and the matrixcalculation step is repeated until all of the source words have beenprocessed.

Next, in step 2013, a score is calculated for that pair using thealignment matrix, as will be described in greater detail below withreference to FIG. 26.

Next, the next pair of source and target analyses are selected (step2014) until all possible combinations of source analysis and targetanalysis have been processed.

Next, in step 2014 the highest scoring pair of analyses and alignmentarrangements within that pair are jointly selected.

At this stage, the new translation texts are marked up in the same wayas shown in FIG. 6, and ready for the processing of FIG. 7 onwards, toperform the “relative clause” transform and the “topic shift” transformand then to generate new translation units (step 2018) and store them(step 2020) for use in subsequent translations.

Referring to FIG. 25, comprising FIGS. 25 a-25 c, the process performedin step 2010 for each source word consists of: selecting a first targetword (step 2022); calculating a score (step 2024, described in greaterdetail in relation to FIGS. 25 b and 25 c) indicating how closely thatword relates to the source word; and adding the score as a new entry tothe matrix to indicate the relation between the source word and thetarget word.

Finally, in step 2028, the next target word is selected and the processis repeated until all are done.

Referring to FIG. 25 b, the process of calculating a score will now bedescribed in greater detail.

First, in step 2032, the existing stored translation unit records aresearched to identify whether the source word and target word alreadyexist as an aligned pair in a translation unit. If so, there is a strongpossibility that the target word represents a translation of the sourceword in the new text. A first variable SCORE1 is allocated (step 2034) avalue of either zero, (in step 2038) if there is no existing translationunit in which the source and target words exist as an aligned pair, ora, (in step 2036) if one or more such translation units do exist. Thevalue a may be a constant, or it may have a value, which depends uponthe ratio of the number of translation units in which the source andtarget words exist as an aligned pair to the total number of translationunits in which either one exists in alignment with any other words.

In step 2040, the target word is looked up in the target lexicondatabase 236, to determined whether it is listed as a translation of thesource word (from the source lexicon 234). If so (step 2042) then thevalue of a variable SCORE2 is set to a value b; if not, it is set tozero (step 2044).

The value b is lower than the value a, since the presence of the word ofthe translation in the lexical database is a less certain indicator thanits presence in previously marked up translations (recorded in theexisting stored translation units).

Finally, referring to FIG. 25 c, in step 2048, the translationdevelopment program performs semantic analysis on the source and targetanalyses, to determine step 2050 whether the target word appearssemantically similar to the source word (for example, in that bothrepresent an entity, or both represent an action; and in that both standin the same relation to other entities or actions). If not, the value ofa variable SCORE3 is set to zero; if so (step 2052), the value of SCORE3is set to c, where c is considerably smaller than either a or b sincethe semantic analysis is expected to be less reliable than either of theprevious two tests.

Finally, in step 2056, a SCORE is calculated as SCORE1+SCORE2+SCORE3.The SCORE indicates, on the totality of the evidence available, theprobability that the target word is a translation of the source word. Inmany cases, the score will be zero. However, since the target text is agenuine translation of the source text, there should be at least onenon-zero score for some source words.

It may be preferable in actual embodiments to vary the above order ofoperations, since the operations performed in FIG. 25 b may not need tobe repeated for each pair of source/target analyses.

Referring to FIG. 26, the process of step 2014 of FIG. 24 will now bedescribed in greater detail. This process is intended jointly to selectthe source/target analysis pair and the source/target word alignmentpair which appear best to represent the translation.

Referring to FIG. 26, the process performed in step 2013 of FIG. 24 isas follows.

In step 2064, the root word of the source analysis and the root word ofthe target analysis are selected, and an alignment record representing alink between them is stored in step 2066.

Next, in step 2068, an isomorphism test is performed. In order to beable to decompose the aligned source and target analyses intotranslation units, only those alignments which satisfy the isomorphismtest need be considered.

Specifically, if the source analysis causes a first source word todominate a second source word, and if the first source word is alignedwith a first target word, and the second source word is aligned with asecond target word, then the first target word must dominate the secondtarget word in the target analysis. If they do not do so, then it willnot be possible to decompose the source and target language texts intotranslation units which can be used for translation as describe above.Thus, no alignment which has this result should be permitted.

Accordingly, in step 2068, the matrix of source target alignments scorescalculated as described above is reviewed, and any potentialsource/target alignments which would violate the isomorphism test areeliminated, by setting their score values to zero.

Of the remaining possible non-zero alignments, the word source/targetword pair with the highest remaining score is next selected in step2070, and steps 2066 and 2068 are repeated, until there are no remainingnon-zero scores in the matrix.

In step 2074, a total score is calculated for the analysis pair andalignment; for example, by adding the total scores of each aligned pairof words. Thus, the total score will depend both on the number of wordswhich were successfully aligned in the analysis, and on the scores foreach of the words thus aligned. Additionally, where the analysisgenerated information on the likelihood that it is correct in the sourcelanguage, and/or the target languages, the summed scores may be added tomultiplied by this source and target analysis information.

Thus, it will be seen that for each analysis, proceeding from the rootnodes, alignment are selected in order of probability that the alignmentis correct, and conflicting alignments are then eliminated.

Thus, after performing the process of FIG. 26, each source/targetanalysis pair includes a number of aligned words (at least one alignmentis present because the root words are always aligned).

As in the above described embodiments, it may be desirable to preventabsolutely every possible translation from being aligned. Accordingly,scores may be set to zero under some particular circumstances even wherethe words are translatable; for example, where the word is both verycommon and has no further words dependent upon it in the analysis.

Although the analyses in the above embodiments were produced using theexisting translation units, it might be possible to apply syntactic andsemantic analysis to analyse the text; any suitable process whichproduces a structured graph which can be converted into a tree-structureof words can be used.

Conclusion

The present invention in its various embodiments provides a translationsystem which does not require manually written linguistic rules, butinstead is capable of learning translation rules from a set of exampleswhich are marked up using a user interface by a human. The marked upexamples are then pre-processed to generalise the translation, and torestrict the number of ungrammatical translation alternatives whichcould otherwise be produced.

The restriction and generalisation examples both rely on the principleof using the simplest models which are consistent with the example data.

The form employed results in translation units which resemble normalgrammar or logic rules to the point where a simple parser, combined withthe unification features of the PROLOG language or similar languages,can perform translation directly.

Embodiments of the invention may be used separately, but are preferablyused together.

Whilst apparatus which comprises both a development program 220 and atranslation program 230 has been described, it will be clear that thetwo could be provided as separate apparatus, the development apparatusdeveloping translation data which can subsequently be used in multipledifferent translation apparatus. Whilst apparatus has been described, itwill be apparent that the program is readily implemented by providing adisc containing a program to perform the development process, and/or adisc containing a program to perform the translation process. The lattermay be supplied separately from the translation data, and the latter maybe supplied as a data structure on a record carrier such as a disc.Alternatively, programs and data may be supplied electronically, forexample by downloading from a web server via the Internet.

Conveniently the present invention is provided for use together with atranslation memory of translation jobs performed by a translator, so asto be capable of using the files in the memory for developingtranslation data.

It may be desirable to provide a linguistic pre- and post-processorprogram arranged to detect proper names, numbers and dates in the sourcetext, and transfer them correctly to the target text.

Whilst the present invention has been described in application tomachine translation, other uses in natural language processing are notexcluded; for example in checking the grammaticality of source text, orin providing natural language input to a computer. Whilst text input andoutput have been described, it would be straightforward to provide thetranslation apparatus with speech-to-text and/or text-to-speechinterfaces to allow speech input and/or output of text.

Whilst particular embodiment have been described, it will be clear thatmany other variations and modifications may be made. The presentinvention extends to any and all such variations, and modifications andsubstitutions which would be apparent to the skilled reader, whether ornot covered by the append claims. For the avoidance of doubt, protectionis sought for any and all novel subject matter and combinations thereof.

1. A computer natural language translation system, comprising: means forinputting source language text; means for outputting target languagetext; and transfer means for generating said target language text fromsaid source language text using stored translation data generated fromexamples of source and corresponding target language texts, the transfermeans being arranged to use data defining a plurality of storedtranslation units each consisting of a small number of ordered wordsand/or variables in both the source and the target language, anddevelopment means for inputting new examples of source and correspondingtarget language texts, and adding new translation units based thereon,the development means being arranged: to apply said stored translationdata to a new example of source and corresponding target language texts,to generate for each at least one analysis comprising analysis dataindicating the dependencies of words therein; to calculate, for each oneof a plurality of source words in the source language text, a measure ofaffinity between each word in the target language text and each suchsource language word; to pair source language words with target languagewords on the basis of the measures thus calculated, and to form newtranslation units comprising a said paired word and those words and/orvariables in both the source and the target language analyses whichdepend upon it.
 2. A system according to claim 1, in which thedevelopment means is arranged to be capable of generating a plurality ofsaid analyses in at least one of the source and target language, and toselect one pair of analyses from which to form said new translationunits.
 3. A system according to claim 2, in which the development meansis arranged to jointly select the pair of analyses and the pairing ofsaid source and target words.
 4. A system according to claim 1, in whichsaid analysis data represents, or can be converted into, a treestructure indicating the dependencies of words therein.
 5. A systemaccording to claim 1, in which the development means is arranged toperform said analyses using the stored translation units.
 6. A systemaccording to claim 1, in which the development means is arranged tocalculate said measures of affinity using the stored translation units.7. A system according to claim 1, in which the development means isarranged to calculate said measures of affinity using a lexicon databasethrough which translations in said source and target languages can beidentified.
 8. A system according to claim 1, in which the developmentmeans is arranged to calculate said measures of affinity using semanticand/or syntactic analysis.
 9. A system according to claim 1, wherein themeasure of affinity is a measure of the probability that each word inthe target language text is a translation of each respective sourcelanguage word.
 10. A system according to claim 1, in which thedevelopment means is arranged to perform said pairing in order ofprobability of correspondence from the highest probability, using saidmeasures of probability.
 11. A system according to claim 10, in which,after each said pairing, the development means is arranged to perform aword order analysis and to reject future pairings which would violate aword order criterion.
 12. A method of obtaining new translation unitsfor a computer translation system, from examples of source andcorresponding target language texts, comprising: analysing the texts toobtain dependency relationships between language units thereof; matchingwords of one text against all those of the other, to generate scores;pairing words of the respective texts using said scores; and providingnew translation units using the paired words, and language units in eachof the languages derived from the analyses.
 13. A computer naturallanguage translation system, comprising: means for inputting sourcelanguage text; means for outputting target language text; transfer meansfor generating said target language text from said source language textusing stored translation data generated from examples of source andcorresponding target language texts, characterised in that said storedtranslation data comprises a plurality of translation components, eachcomprising: surface data representative of the order of occurrence oflanguage units in said component; dependency data related to thesemantic relationship between language units in said component; and thedependency data of language components of said source language beingaligned with corresponding dependency data of language components ofsaid target language, and in that said transfer means is arranged to usesaid surface data of said source language in analysing the sourcelanguage text, and said surface data of said target language ingenerating said target language text, and said dependency data intransforming the analysis of said source text into an analysis for saidtarget language.
 14. A computer language translation development system,for developing data for use in translation, comprising: means forallowing corresponding source and target example texts to be linked intosource and target language dependency graphs; means for allowingcorresponding translatable nodes of said source and target languagedependency graphs representing translatable parts of the source andtarget texts to be aligned; and means for automatically generating, fromsaid source and target language dependency graphs, respective associatedsurface representative graph having a tree structure.
 15. A computerprogram comprising code to execute on a computer to cause said computerto act as the system of claim
 1. 16. Apparatus for inferring newtranslation units which will allow a given source text to translate as agiven target text comprising, a database of translation units; meansarranged to analyse both the source text and the target text into one ormore alternative representations using these units; means arranged toindicate and score lexical alignments between the source and targettexts; means arranged to select one of the alternative source analysesand one of the alternative target analyses based on the scoredalignments; and means arranged to infer one or more translation unitsbased on the selected source analysis, the target analysis and thealignment.
 17. Apparatus according to claim 16 wherein said alternativerepresentations are tree representations or representations that can beconverted into tree representations.